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This work presents a family of parsimonious Gaussian process models which allow 
to build, from a finite sample, a model-based classifier in an infinite dimensional 
space. The proposed parsimonious models are obtained by constraining the eigen- 
pj ' decomposition of the Gaussian processes modeling each class. This allows in 

particular to use non-linear mapping functions which project the observations into 
infinite dimensional spaces. It is also demonstrated that the building of the classifier 
^ ' can be directly done from the observation space through a kernel function. The 

proposed classification method is thus able to classify data of various types such 
as categorical data, functional data or networks. Furthermore, it is possible to 
classify mixed data by combining different kernels. The methodology is as well 



(N 

Q I extended to the unsupervised classification case. Experimental results on various 



data sets demonstrate the effectiveness of the proposed method. 



1 Introduction 



. Classification is an important and useful statistical tool in all scientific fields where decisions 

■ have to be made. Depending on the availability of a learning data set, two main situations 

■ ■ ■ may happen: supervised classification (also known as discriminant analysis) and unsupervised 

classification (also known as clustering). Discriminant analysis aims to build a classifier (or 
a decision rule) able to assign an observation x in an arbitrary space E with unknown class 
membership to one of k known classes Ci,...,Ck- For building this supervised classifier, 
a learning dataset {{xi, zi), {xn, Zn)} is used, where the observation xe e E and Zi G 
{1,...,A:} indicates the class belonging of the observation X£. In a slightly different context, 
clustering aims to directly partition an incomplete dataset {xi,...,Xn} into k homogeneous 
groups without any other information, i.e., assign to each observation x^ ^ E its group 
membership zi G Several intermediate situations exist, such as semi-supervised or 

weakly-supervised classifications [6J, but they will not be considered here. 
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Since the pioneer work of Fisher [10| . a huge number of supervised and unsupervised classifi- 
cation methods have been proposed in order to deal with different types of data. Indeed, there 
exist a wide variety of data such as quantitative, categorical and binary data but also texts, 
functions, sequences, images and more recently networks. As a practical example, biologists are 
frequently interested in classifying biological sequences (DNA sequences, protein sequences), 
natural language expressions (abstracts, gene mentioning), networks (gene interactions, gene 
co-expression), images (cell imaging, tissue classification) or structured data (gene structures, 
patient information). The observation space E can be therefore W if quantitative data are 
considered, L^([0, 1]) if functional data are considered (time series for example) or A^, where 
^ is a finite alphabet, if the data at hand are categorical (DNA sequences for example). 
Furthermore, the data to classify can be a mixture of different data types: categorical and 
quantitative data or categorical and network data for instance. 

Classification methods can be split into two main families: generative and discriminative 
techniques. On the one hand, generative techniques model the data of each class with a 
probability distribution and deduce the classification rule from this modeling. Model-based 
discriminant analysis assumes that {xi, x„} are independent realizations of a random vector 
X on E and that the class conditional distribution of X is parametric, i.e. f{x\z = i) = 
fi{x]9i). When E = R^, among the possible parametric distributions for fi, the Gaussian 
distribution is often preferred and, in this case, the marginal distribution of X is therefore a 
mixture of Gaussians: 

k 
i=l 

where c/) is the Gaussian density, tTj is the prior probability of the ith class, fii is the mean 
of the ith class and is its covariance matrix. In such a case, the optimal decision rule is 
called the maximum a posteriori (MAP) rule which assigns a new observation x to the class 
which has the largest posterior probability. Introducing the classification function Di{x) = 
log -|- (x — fiiYY,~^{x — fii) — 2 log(7rj), which can be rewritten as: 

Pi P 
Di{x) = ^ — < X - >|p -F^log(Aij) - 21og(7ri), (1) 

3=1 3=1 

where qij and Xij are respectively the j'th eigenvector and eigenvalue of Sj, it can be easily 
shown that the MAP rule reduces to finding the label i G {1, . . . , fc} for which Di{x) is the 
smallest. Estimation of model parameters is usually done by maximum likelihood. This method 
is known as the quadratic discriminant analysis (QDA), and, under the additional assumption 
that Si = S for all « € {1, ... , k}, it corresponds to the linear discriminant analysis (LDA). A 
detailed overview on this topic can be found in [l5]. Recent extensions allowing to deal with 
high-dimensional data include [T] [21 El [TBI [17] [201 [21] . Although model-based classification is 
usually enjoyed for its multiple advantages, model-based discriminant analysis methods have 
however two limiting characteristics. First, they are limited to quantitative data and cannot 
process for instance qualitative or functional data. Second, even in the case of quantitative 
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data, the Gaussian assumption may not be well-suited for the data at hand. 

On the other hand, discriminative techniques directly build the classification rule from the 
learning dataset. Among the discriminative classification methods, kernel methods [13] are 
probably the most popular and overcome some of the shortcomings of generative techniques. 
Kernel methods are non-parametric algorithm and can be applied to any data for which a 
kernel function can be defined. A kernel K : E x E ^ R \s a positive definite function such 
as every evaluation can be written as K(xi,Xj) =< (p{xi),ip{xj) >-^, with Xi,Xj £ E, ip a 
mapping function (called the feature map), T-L a finite or infinite dimensional reproducing kernel 
Hilbert space (the feature space) and < •, • the dot product in H. An advantage of using 
kernels is the possibility of computing the dot product in the feature space from the original 
input space without explicitly knowing 99 (kernel trick) [13]. Turning conventional learning 
algorithms into kernel learning algorithms can be easily done if the algorithms operate on the 
data only in terms of dot product. In particular, the kernel trick is used to transform linear 
algorithms to non-linear ones. Additionally, a nice property of kernel learning algorithms is the 
possibility to deal with any kind of data. The only condition is to be able to define a positive 
definite function over pairs of elements to be classified [13J. For instance, kernel functions 
can be defined on strings [271 Chap. 10 and 11], graphs [23] or trees [23 Chap. 5]. Many 
conventional linear algorithms have been turned to non-linear algorithms thanks to kernels |24| . 
For instance, a kernelized version of principal component analysis (KPCA) has been proposed 
in [25]. Mika et al. have also proposed kernel Fisher discriminant (KFD) as a non-linear 
version of FDA which only relies on kernel evaluations |19| . A kernelized Gaussian mixture 
model (KGMM) has been proposed in [9] for the supervised classification of hyperspectral data. 
But, due to computational considerations, the authors have introduced a strong assumption: 
the classes share the same covariance matrix in the feature space. However, the method still 
needs to be regularized. Recently, pseudo-inverse and ridge regularization have been proposed 
to define a kernel quadratic classifier where classes have their own covariance matrices \22] . 
In all these cases, a benefit is found by using the kernel version rather than the original 
algorithm. KPCA shows better results results than RCA in terms of reconstruction errors for 
image denoising |14] . Kernel GMM provides better accuracy than conventional GMM for the 
classification of hyperspectral images [9j. Let us however highlight that the kernel version 
involves the inversion of a kernel matrix, i.e., a n x n matrix estimated with only n samples. 
Usually, the kernel matrix is ill-conditioned and regularization is needed, while sometimes a 
simplified model is required too. Thus, it may limit the effectiveness of the kernel version. 
In addition, and conversely to model-based techniques, the classification results provided by 
kernel methods are unfortunately difficult to interpret which would be useful in many application 
domains. 

In this work, we propose to adapt model-based methods for the classification of any kind 
of data by working in a feature space of high or even infinite dimensional space. To this end, 
we propose a family of parsimonious Gaussian process models which allow to build, from a 
finite sample, a model-based classifier in a infinite dimensional space. It will be demonstrated 
that the building of the classifier can be directly done from the observation space through the 
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so-called "kernel trick". The proposed classification method will be thus able to classify data of 
various types (categorical data, mixed data, functional data, networks, ...). The methodology 
is as well extended to the unsupervised classification case (clustering). 

The paper is organized as follows. Section [2] presents the context of our study and introduces 
the family of parsimonious Gaussian process models. The inference aspects are addressed in 
Section O It is also demonstrated in this section that the proposed method can work directly 
from the observation space through a kernel. Sectionals dedicated to some special cases 
and to the extension to the unsupervised framework. Experimental comparisons with state- 
of-the-art kernel methods are presented in Section [5] as well as applications of the proposed 
methodologies to various types of data including functional, categorical, mixed and network 
data. Some concluding remarks are given in Section [6] and proofs are postponed to the 
appendix. 

2 Classification with parsimonious Gaussian process models 

in this section, we first define the context of our approach and exhibit the associated compu- 
tational problems. Then, a parsimonious parameterization of Gaussian processes is proposed 
in order to overcome the highlighted computational issues. 

2.1 Classification with Gaussian processes 

Let us consider a learning set {(xi, zi), (x„, where {xi,...,Xn} C E are assumed to 
be independent realizations of a, possibly non-quantitative and non-Gaussian, random variable 
X. The class labels {zi,...,Zn} are assumed to be realizations of a discrete random variable 
Z G {1, k}. It indicates the memberships of the learning data to the k classes denoted by 
Ci, . . . , Cfc, i.e., Z£ = i indicates that xe belongs to Cj. 

Let us assume that there exists a non-linear mapping if such that Y = ip{X) is, condition- 
ally on Z = i, a Gaussian process on [0,1] with mean /ij and continuous covariance function 
Si. More specifically, one has /ii(i) = E{Y{t)\Z = i) and Si(s,t) = {s)Y {t)\Z = 
i) — fj.i{t) fii{s) . It is then well-known [28] that, for all i = 1, . . . , k, there exist positive eigen- 
values (sorted in decreasing order) {Xij}j>i, together with eigenvector functions {qij{-)}j>i 
continuous on [0,1], such that 

oo 

Sj(s,t) = '^Xijqij{s)qij{t), 
i=i 

where the series is uniformly convergent on [0,1]^. Moreover, the eigenvector functions are 
orthonomal in -L^([0, 1]) for the dot product < f,g >L2= Jq f{t)g{t)dt. It is then easily seen, 
that, for all r > 1 and i G {1,... the random vector on W defined by {< Y,qij 
}j=x....,r is, conditionally on Z = i, Gaussian with mean {< fii,qij >}j=i,...,r and covariance 
matrix diag(Aii, . . . , Xir). To classify a new observation x, we therefore propose to apply the 
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Gaussian classification function ([T]) to ip{x): 

Di{ip{x)) = < ^{x) - fii,qij +^log(Aij) - 21og(7ri). 

From a theoretical point of view, if the Gaussian process is non degenerated, one should use 
r = +00. In practice, r has to be large in order not to loose to much information on the 
Gaussian process. Unfortunately, in this case, the above quantities cannot be estimated from 
a finite sample set. Indeed, only a part of the classification function can be actually computed 
from a finite sample set: 

Di{ip{x)) = E ^ < V'^^) ~ >L2 + '^^og{\ij) - 21og(7rj) 

i=i 3=1 

computable quantity 
r -j^ r 

+ ^ — < "Pi^) - >L2 + E log(Aij), 

j=ri+l i=n+l 

v ' 

non computable quantity 

where = min(nj,r) and rij = Card(Cj). Consequently, the Gaussian model cannot be used 
directly in the feature space to classify data if r > rij for i = I, k. 

2.2 A parsimonious Gaussian process model 

To overcome the computation problem highlighted above, it is proposed here to use in the 
feature space a parsimonious model for the Gaussian process modeling each class. Following 
the idea of [3J, we constrain the eigen-decomposition of the Gaussian processes as follows. 

Definition 1. A parsimonious Gaussian process model is a Gaussian process Y for which, 
conditionally to Z = i, the eigen-decomposition of its covariance operator Sj is such that: 

(Al) it exists a dimension r < +00 such that Xij = for j > r and for all i = I, k, 

(A2) it exists a dimension di < min{r, ni} such that Xij = X for di < j < r and for all 

i = 1, k. 

It is worth noticing that r can be as large as it is desired, as long it is finite, and in 
particular r can be much larger than n^, for any i = l,...,k. From a practical point of 
view, this modeling can be viewed as assuming that the data of each class live in a specific 
subspace of the feature space. The variance of the actual data of the ith group is modeled by 
the parameters Aji,..., Ajd, and the variance of the noise is modeled by A. This assumption 
amounts to supposing that the noise is homoscedastic and its variance is common to all the 
classes. The dimension di can be considered as well as the intrinsic dimension of the latent 
subspace of the ith group in the feature space. This model is referred to by pgpA^o (or Aio 
for short) hereafter. With these assumptions, we have the following result. 
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Model 


Variance inside 
the subspace r-i 


Variance 
outside 


Subspace 
orientation Qi 


Intrinsic 
dimension di 


Mo 


Free 


Common 


Free 


Free 


Ml 


Free 


Common 


Free 


Common 


M2 


Common within groups 


Common 


Free 


Free 


Ms 


Common within groups 


Common 


Free 


Common 


Mi 


Common between groups 


Common 


Free 


Common 


M5 


Common within and between groups 


Common 


Free 


Free 


Me 


Common within and between groups 


Common 


Free 


Common 


M7 


Common between groups 


Common 


Common 


Common 


Ms 


Common within and between groups 


Common 


Common 


Common 



Table 1: List of the submodels of the parsimonious Gaussian process model (referred to by Mo 
here). 

Proposition 1. Letting d^iax = max((ii, d^), the classification function Di can be written 
as follows in the case of a parsimonious Gaussian process model pgpM. : 

( I l\ 1 

di 

+ log(^u) + (^max - di) log(A) - 21og(7ri) + 7, (2) 

i=i 

where 7 is a constant term which does not depend on the index i of the class. 

At this point, it is important to notice that the classification function Di depends only 
on the eigenvectors associated with the di largest eigenvalues of Sj. This estimation is now 
possible due to the inequality di < Ui for i = l,...,k. Furthermore, the computation of the 
classification function does not depend any more on the parameter r. As shown in the next 
section, it is possible to reformulate the classification function such that it does not depend 
either on the mapping function ip. 

2.3 Submodels of the parsimonious model 

By fixing some parameters to be common within or between classes, it is possible to obtain 
particular models which correspond to different regularizations. Table [T] presents the 8 ad- 
ditional models which can be obtained by constraining the parameters of model Mq. For 
instance, fixing the dimensions di to be common between the classes yields the model M-i. 
Similarly, fixing the first di eigenvalues to be common within each class, we obtain the more 
restricted model Ai2- It is also possible to constrain the first di eigenvalues to be common 
between the classes (models 7W4 and M-7), and within and between the classes (models M-5, 
Me and A^s)- This family of 9 parsimonious models should allow the proposed classification 
method to fit into various situations. 
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3 Model inference and classification with a kernel 



This section focuses on the inference of the parsimonious models proposed above and on the 
classification of new observations through a kernel. Model inference is only presented for the 
model Mo since inference for the other parsimonious models is similar. Estimation of intrinsic 
dimensions and visualization in the feature subspaces are also discussed. 

3.1 Estimation of model parameters 

In the model-based classification context, parameters are usually estimated by their empirical 



counterparts [15| which conduces, in the present case, to estimate the proportions tTj by 
TTi = rii/n and the mean function by fii{t) = — ^ ip{xj){t). Regarding the covariance 

Xj G Cj 

operator, the eigenvalue Xij and the eigenvector qij are respectively estimated by the jth 
largest eigenvalue Xij and its associated eigenvector function qij of the empirical covariance 
operator Sj: 

ti{s,t) = — V ip{xe){s)ip{xe){t) - ili{s)(ii{t). 
Finally, the estimator of A is: 

1 ^ ( -~ 

I trace(Si) - ^ Xij j . (3) 



Eti (r - di) ^ 



Using the plug-in method, the estimated classification function Di can be written as follows: 

* / 1 l\ 1 

Di{<f{x)) = < fix) - fit,qij +j\Mx) - fj-iWi^ 

j=l \Xij X) X 

di 

+ ^log(Aij) + ((imax-(ii)log(A)-21og(7ri). (4) 
i=i 

However, as we can see, the estimated classification function Di still depends on the func- 
tion Lp and therefore requires computations in the feature space. However, since all these 
computations involve dot products, it will be shown in the next paragraph that the estimated 
classification function can be computed without explicit knowledge of 99 through a kernel 
function. 

3.2 Estimation of the classification function through a kernel 

Kernel methods are all based on the so-called "kernel trick" which allows the computation 
of the classifier in the observation space through a kernel K. Let us therefore introduce the 
kernel K -.E defined as K{x^y) =< (p{x),if{y) and pi : E x E ^ M. defined 

as pi{x,y) =< (p{x) — fii,ip{y) — fii >l2- In following, it is shown that the classification 
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Kernels 


K{x,y) 




Linear 
Gaussian 

Polynomial 


(< x,y >L2 +1)'' 


min(ni,p) 

m 

mill [ui, 



Table 2: Dimension for several kernels. 



function Di only involves pi which can be computed using K: 

Pi{x,y) = \ V < ip{x) - Lp{xi),ip{y) - Lp{xi:) (5) 
= K{x,y) - — "Y {K{xe,y) + K{x,xe)) + ^ (6) 

For each class Ci, let us introduce the x symmetric matrix Mi defined by: 

pi{xi,xi:) 

Ui 

With these notations, we have the following result. 

Proposition 2. For i = I, . . . ,k, the estimated classification function can be computed, in 
the case of the model Mq, as follows: 

Di{<f{x)) = \J- -j] I (3tjfMx,xe)\ +lp,{x,x) 

di 

+ ^ log(Aij) + (dniax - di) log(A) - 21og(7ri), 
i=i 

where, for j = 1, . . . ,di, /3ij is the normed eigenvector associated to the jth largest eigenvalue 
Xij ofM, and X = 1/Ei=i - di) x ^^-^^ rn (trace(M,) - EjLi >^ij) ■ 

It thus appears that each new sample point x can be assigned to the class Ci with the 
smallest value of the classification function without knowledge of ip. The methodology based 
on Proposition [2] is referred to pgpDA in the sequel. In practice, the value of r,; depends on 
the chosen kernel (see Table |2]for examples). 



3.3 Intrinsic dimension estimation 

The estimation of the intrinsic dimension of a dataset is a difficult problem which occurs 
frequently in data analysis, such as in principal component analysis. A classical solution in 
PCA is to look for a break in the eigenvalue scree of the covariance matrix. This strategy 
relies on the fact that the jth eigenvalue of the covariance matrix corresponds to the fraction 
of the full variance carried by the jth eigenvector of this matrix. Since, in our case, the class 
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conditional matrix Mi shares with the empirical covariance operator of the associated class 
its largest eigenvalues, we propose to use a similar strategy based on the eigenvalue scree of 
the matrices Mj to estimate di, i = l,...,k. More precisely, we propose to make use of the 
scree-test of Cattell [5j for estimating the class specific dimension di, i = l,...,k. For each 
class, the selected dimension is the one for which the subsequent eigenvalues differences are 
smaller than a threshold which can be tuned by cross-validation for instance. 

3.4 Visualization in the feature subspaces 

An interesting advantage of the approach is to allow the visualization of the data in subspaces 
of the feature space. Indeed, even though the chosen mapping function is associated with a 
space of very high or infinite dimension, the proposed methodology models and classifies the 
data in low-dimensional subspaces of the feature space. It is therefore possible to visualize the 
projection of the mapped data on the feature subspaces of each class using Equation (ill!) of 
the appendix. The projection of on the jth axis of the class Cj is therefore given by: 



Thus, even if the observations are non quantitative, it is possible to visualize their projections 
in the feature subspaces of the classes which are quantitative spaces. 

4 Particular cases and extension to clustering 

The methodology proposed in the previous section is made very general by the large choice 
for the mapping function ip{x). We focus in this section on two specific choices for ip{x) for 
which the direct calculation of the classification rule is possible. An extension to unsupervised 
classification is also considered through the use of an EM algorithm. 

4.1 Case of the linear kernel for quantitative data 

In the case of quantitative data, E = and one can choose ip{x) = x associated to 
the standard scalar product which gives rise to the linear kernel K{x,y) = x^y. In such a 
framework, the estimated classification function can be simplified as follows: 

Proposition 3. If E = MP and K{x, y) = x*y then, for i = 1, . . . , k, the estimated classifica- 
tion function reduces to 




1 



A(x) = y: 



di 




{q\j{x-fii)) + 




+ ^log(Aij) + {d 

1 



max 



(i,)log(A) -21og(^i). 
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where fii is the empirical mean of the class Ci, qij is the eigenvector of the empirical covariance 
matrix Ej associated to the jth largest eigenvalue Xij and X is given by (Qj- 

It appears that the estimated classification function reduces to the one of the HDDA 
method [3j with the model [aijbQid] which has constraints similar to M-q. Therefore, the 
methodology proposed in this work partially encompasses the method HDDA. 

4.2 Case of functional data 

Let us consider now functional data observed in £' = L^([0, 1]). Let (fej)j>i be a basis of 
L^([0, 1]) and F = where L is a given integer. For all I = 1, . . . , L, the projection of a 
function x on the jth basis function is computed as 

7j(x) = / x{t)bj{t)dt 
Jo 

and 7(x) := (7j(x))j=i^...^L. Let B the L x L Gram matrix associated to the basis: 

Bje= [ bj{t)be{t)dt, 
Jo 

and consider the associated scalar product defined by < u,v >= u^Bv for all u,v ^ M^. One 
can then choose ip{x) = B^^j{x) and K{x,y) = 7(x)*i?^^7(y) leading to a simple estimated 
classification function. 

Proposition 4. Let E = i>^([0, 1]) and K{x, y) = j{xYB^^j{y). Introduce, for i = 1, . . . ,k, 
the L X L covariance matrix of the 7(xj) when xj G Ci: 

ti = — hi^i) - 7i)(7(2;<?) - liY where 7^ = — ^ j{xj) 
Then, for i = I, . . . ,k, the estimated classification function reduces to 

A(^(x)) =J2(^-\] fe(7(^) - 7.))' + i(7(x) - liYB-^^ix) - 7.) 
j=l \Xij X) X 

di 

+ ^ log(Aij) + (dmax - di) log(A) - 21og(7ri), 

i=i 

where qij and Xij are respectively the jth normed eigenvector and eigenvalue of the matrix 
and X is given by 

Remark that B^^Tji coincides with the matrix of interest in functional PCA [531 Chap. 8.4] 
and that, if the basis is orthogonal, then B is the identity matrix. Notice that the proposed 
method therefore encompasses as well the model proposed in [4j for the clustering of functional 
data. 
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4.3 Extension to unsupervised classification 

Since the previous section has demonstrated the possibility to use the Gaussian classification 
function in the feature space, it is also possible to extend its use to unsupervised classification 
(also known as clustering). Indeed, in the model-based classification context, the unsuper- 
vised and supervised cases mainly differ in the manner to estimate the parameters of the 
model. The clustering task aims to form k homogeneous groups from a set of n observations 
{xi, Xn} without any prior information about their group memberships. Since the labels 
are not available, it is not possible in this case to directly estimate the model parameters. 
In such a context, the expectation-maximization (EM) algorithm [8J is frequently used. As a 
consequence, the use of the EM algorithm allows to both estimate the model parameters and 
predict the class memberships of the observations at hand. In the case of the parsimonious 
model M-o introduced above, the EM algorithm takes the following form: 

The E step This first step reduces, at iteration q, to the computation of = K{Zj = 



i\x 



('^ ^^), for j = 1, . . . , n and i = 1, . . . ,k, conditionally on the current value of the model 
parameter 9^'^^^^: 

tg) = l/^exp {Dt'\v{x,)) - Dt'Hvi^j))) , (7) 
i=i 

where 



+ + Y. l°g(^ir") + - "iijloglA"-") - 21og(*J«-"). 

is the Gaussian classification function associated with the model parameters estimated in the 
M step at iteration q — l. This result can be proved by substituting Equation (jlOl) in the proof 
of Proposition [2] by: 

Qij = PijeVteiivixi) - fii). (8) 

niXij xieCi 

The M step This second step estimates the model parameters conditionally on the posterior 
probabilities tjj^ computed in the previous step. In practice, this step reduces to update the 
estimate of model parameters according to the following formula: 

■ mixture proportions are estimated by vr,-'''' = nf' jn where nf' = J2'j=itif' 

■ parameters Xij, A, /3ij and di are estimated at iteration q using the formula given in 
Proposition [2] but where the matrix Mj is now a n x n matrix, recomputed at each 
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iteration q, and such that, for i = 1, /c and ^, = 1, ... 



n: 





pf'{xt,xej) 



e,e' 



where {xi,X£i) can be computed through the kernel K as follows: 



Pi \xi,,xe) =K{xi,xe) {K{xj,X(,) + K{xfj,Xj)) 



■i i=i 



1 " 



The clustering algorithm associated with this methodology will be denoted to by pgpEM in 
the following. 

5 Benchmark study and applications to non-quantitative data 

In this section, numeral experiments and comparisons are conducted on real-world data sets 
to highlight the main features of the pgpDA and pgpEM methods. 

5.1 Benchmark study on quantitative data 

We focus here on the comparison of pgpDA with state-of-the-art methods. To that end, 
two kernel generative classifiers are considered, kernel Fisher discriminant analysis (KFD) jl9| 
and kernel QDA (KQDA) [9J, and one kernel discriminative classifier, support vector machine 
(SVM) [24j. The Gaussian kernel is used once again in the experiments for all methods, 
including pgpDA. Since real-world problems are considered, all the hyper-parameters of the 
classifiers have been tuned using 5-fold cross-validation. 

Six data sets from the UCI Machine Learning Repository {http://archive.ics.uci.edu/ml/) 
have been selected: glass, ionosphere, iris, sonar, USPS and wine. We selected these data 
sets because they represent a wide range of situations in term of number of observations n, 
number of variables p and number of groups k. The USPS dataset has been modified to focus 
on discriminating the three most difficult classes to classify, namely the classes of the digits 
3, 5 and 8. This dataset has been called USPS 358. The main feature of the data sets are 
described in Table [3l 

Each data set was randomly split into training and testing sets in the hold-out ratio hr 
given in Table O The data were scaled between -1 and 1 on each variable. The search 
range for the cross-validation was for the kernel hyperparameter a G [—4,4], for the common 
intrinsic dimension d G [1,20], for the scree test threshold r G [10"'', 1], for the regularization 
parameter in KFD and KQDA A G [10"^^, 10"^] and for the penalty parameter of the SVM 
7 G [2^,2^]. The global classification accuracy was computed on the testing set and the 
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Dataset 


n 


P 


n/p 


k 


hr 


Iris 


150 


4 


37.5 


3 


0.5 


Glass 


214 


9 


23.7 


6 


0.25 


Wine 


178 


13 


13.7 


3 


0.5 


Ionosphere 


351 


34 


10.3 


2 


0.5 


Sonar 


208 


60 


3.5 


2 


0.5 


USPS 358 


2248 


256 


8.8 


3 


0.5 



Table 3: Data used in the experiments, n is the number of samples, p is the number of features, 
k is the number of classes and hr is the hold-out ratio used in the experiments. 

reported results have been averaged over 50 replications of the whole process. The average 
classification accuracies and the standard deviations are given in TablelU 

Regarding the competitive methods, KFD and SVM provide often better results than KQDA. 
The model used in KQDA only fits "ionosphere", "iris" and "wine" data, for which classification 
accuracies are similar to or better than those obtain with KFD and SVM. For the parsimonious 
pgpDA models, except for My and M-s, the classification accuracies are globally good. Models 
Ail and Ai^ provide the best results in terms of average correct classification rates. In 
particular, for the "USPS 358" and "wine" data sets, they provide the best overall accuracies. 
Let us remark that pgpDA performs significantly better than SVM (for the Gaussian kernel) 
on high-dimensional data (USPS 358). 

In conclusion of these experiments, by relying on parsimonious models rather than regular- 
ization, pgpDA provides good classification accuracies and it is robust to the situation where 
few samples are available in regards to the number of variables in the original space. In prac- 
tice, models Aii and A4i should be recommended: intrinsic dimension is common between 
the classes and the variance inside the class intrinsic subspace is either free or common. Con- 
versely, models M-r and Ms must be avoided since they appeared to be too constrained to 
handle real classification situations. 

5.2 Classification of functional data: the Canadian temperatures 

We now focus on illustrating the possible range of application of the proposed methodologies 
to different types of data. We consider here the clustering of functional data with pgpEM for 
which the mapping function ip is explicit (see Section 02]). The Canadian temperature data 
used in this study, presented in details in |23| . consist in the daily measured temperatures at 
35 Canadian weather stations across the country. The pgpEM algorithm was applied here with 
the model Mo, which is the most general parsimonious Gaussian process model proposed in 
this work, with a fixed number k of groups set to 4. The mapping function if consists in the 
projection of the observed curves on a basis of 20 natural cubic splines. Once the pgpEM 
algorithm has converged, various informations are available and some of them are of particular 
interest. Group means, intrinsic dimensions of the group-specific subspaces and functional 
principal components of each group could in particular help the practitioner in understanding 
the clustering of the dataset at hand. The left panel of Figu re [T] presents the clustering of the 
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Method 


Iris 


Glass 


Wine 


Ionosphere 


Sonar 


USPS 358 


Mean (rank) 


pgpDA Mo 


95.9± 2.1 


64.9 ± 6.3 


96.8 ± 1.7 


90.5 ± 2.3 


77.9 ± .9 


92.2 ± 1.0 


86.4 (5) 


pgpDA Ml 


95.2± 2.1 


62.6 ± 12.5 


96.7 ± 2.3 


93.7 ± 1.6 


81.8 ± 4.9 


96.6 ± 0.4 


87.8 (2) 


pgpDA M2 


94.4± 6.2 


64.4 ± 6.7 


96.8 ± 1.8 


91.0 ± 2.8 


71.6 ± 13.4 


95.4 ± 0.8 


85.6 (7) 


pgpDA Ms 


95.8± 2.3 


64.3 ± 6.8 


96.9 ± 2.0 


93.2 ± 2.1 


79.3 ± 4.9 


96.2 ± 0.5 


87.6 (3) 


pgpDA Ma 


94.4± 2.2 


65.3 ± 6.4 




93.4 ± 2.0 


81.6 ± 4.5 


96.3 ± 0.7 




pgpDA M5 


94. 2± 7.1 


59.8 ± 10.9 


96.4 ± 2.0 


92.0 ± 1.8 


72.5 ± 12.6 


96.0 ± 0.5 


85.2 (8) 


pgpDA 7^6 


94.8± 2.1 


65.2 ± 5.6 


97.2 ±1.8 


92.5 ± 2.1 


79.8 ± 4.9 


96.1 ± 0.5 


87.6 (3) 


pgpDA M7 


41.3± 16.5 


40.0 ± 5.4 


75.2 ± 8.3 


64.6 ± 2.6 


48.8 ± 5.7 


63.5 ± 1.5 


55.5 (11) 


pgpDA Ats 


29.2± 17.4 


35.4 ± 7.9 


64.2 ± 26.8 


64.3 ± 2.5 


50.5 ± 5.5 


36.8 ± 1.2 


46.7 (12) 


KFD 


93.4± 3.7 


47.3 ± 10.1 


95.9 ± 2.3 


94.1 ± 1.7 


82.9 ± 3.1 


93.6 ± 0.5 


84.5 (9) 


KQDA 


96.6± 2.3 


64.5 ± 6.3 


96.6 ± 1.7 


88.1 ± 2.3 


68.9±18.1 


64.7 ± 37.5 


79.9 (10) 


SVM 


95.7± 2.0 


69.1 ± 5.5 


96.8 ± 1.4 


92.8 ± 1.8 


84.8 ± 4.0 


77.6 ± 5.4 


86.1 (6) 



Table 4: Classification results on real-world datasets: reported values are average correct classification rates and standard deviation computed on 
validation sets. 



o 





Figure 1: Clustering of the 35 times series of the Canadian temperature data set into 4 groups 
with pgpEM (left) and geographical positions of the weather stations according to 
their group belonging (right). The colors indicate the group memberships: group 1 
in black, group 2 in red, group 3 in green and group 4 in blue. 

temperature data set into 4 groups with pgpEM. 

It is first interesting to have a look at the name of the weather stations gathered in the 
different groups formed by pgpEM. It appears that group 1 (black solid curves) is mostly 
made of continental stations, group 2 (red dashed curves) mostly gathers the stations of 
the North of Canada, group 3 (green dotted curves) mostly contains the stations of the 
Atlantic coast whereas the Pacific stations are mostly gathered in group 4 (blue dot-dashed 
curves). For instance, group 3 contains stations such as Halifax (Nova Scotia) and St Johns 
(Newfoundland) whereas group 4 has stations such as Vancouver and Victoria (both in British 
Columbia). The right panel of Figure[T] provides a map of the weather stations where the colors 
indicate their group membership. This figure shows that the obtained clustering with pgpEM 
is very satisfying and rather coherent with the actual geographical positions of the stations 
(the clustering accuracy is 71% here compared with the geographical classification provided by 
[23|). We recall that the geographical positions of the stations have not been used by pgpEM 
to provide the partition into 4 groups. 

An important characteristic of the groups, but not necessarily easy to visualize, is the specific 
functional subspace of each group. A classical way to observe principal component functions 
is to plot the group mean function jli{t) as well as the functions fii{t) ±2\J\ijqij{t) (see [23] 
for more details). Figure [2] shows such a plot for the 4 groups of weather stations formed by 
pgpEM. It first appears on the first functional principal component of each group that there 
is more variance between the weather stations in winter than in summer. In particular, the 
first principal component of group 4 (blue curves, mostly Pacific stations) reveals a specific 
phenomenon which occurs at the beginning and the end of the winter. Indeed, we can observe 
a high variance in the temperatures of the Pacific coast stations at these periods of time 
which can be explained by the presence of mountain stations in this group. The analysis of 
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PCA function 1 (Percentage of variability 85.1 ) 



PC A function 2 (Percentage of variability 17.6 ) 
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(a) Group 1 (mostly continental stations) 

PCA function 1 (Percentage of variability 98.2 ) PCA function 2 (Percentage of variability 2.4 ) 
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(b) Group 2 (mostly Arctic stations) 



PCAfunctionl (Percentage of variabiiity 91.1 ) 
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PCA function 2 (Percentage of variability 6.5 ) 
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(c) Group 3 (mostly Atlantic stations) 



PCA function 1 (Percentage of variability 97.5 ) 



PCA function 2 (Percentage of variability 12.5 ) 
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(d) Group 4 (mostly Pacific stations) 

Figure 2: The group means of the Canadian temperature data obtained with pgpEM and the 
effects of adding (+) and subtracting (aCS) twice the square root of the feature 
subspace variance (see text for details). 
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Figure 3: Regularized Laplacian kernel associated to the Add Health network for v = A: blue 
pixels correspond to low values (low similarity between nodes) and red pixels corre- 
spond to high values (high similarity between nodes). 

the second principal components reveals finer phenomena. For instance, the second principal 
component of group 1 (black curves, mostly continental stations) shows a slight shift between 
the + and alS along the year which indicates a time-shift effect. This may mean that some 
cities of this group have their seasons shifted, e.g. late entry and exit in the winter. Similarly, 
the inversion of the -|- and alS on the second principal component of the Pacific and Atlantic 
groups (blue and green curves) suggests that, for these groups, the coldest cities in winter are 
also the warmest cities in summer. On the second principal component of group 2 (red curves, 
mostly Arctic stations), the fact that the -|- and aLS curves are almost superimposed shows 
that the North stations have very similar temperature variations (different temperature means 
but same amplitude) along the year. 

5.3 Classification of networks: the Add Health dataset 

We now consider network data which are nowadays widely used to represent relationships 
between persons in organizations or communities. Recently, the need of classifying and visual- 
izing such data has suddenly grown due to the emergence of Internet and of a large number of 
social network websites. Indeed, increasingly, it is becoming possible to observe aAlJnetwork 
informationsaAi in a variety of contexts, such as email transactions, connectivity of web pages, 
protein-protein interactions and social networking. A number of scientific goals can apply to 
such networks, ranging from unsupervised problems such as describing network structure, to 
supervised problems such as predicting node labels with information on their relationships. 

We investigate here the use of pgpDA to classify the nodes of a network. To our knowledge, 
only a few kernels (see for more details) have been proposed for network data and the 
regularized Laplacian kernel is probably the most used. This kernel is defined as follows: let 
X be a symmetric n x n socio-matrix where Xij = 1 if a relationship is observed between the 
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(a) Subspace of class 2 



-0.2 -0.15 




(b) Subspace of class 4 



Figure 4: Visualisation of the Add Health network with pgpDA in the feature subspace of the 
2nd and the 4th class (grade 8 and 10 respectively). 



nodes i and j and Xij = in the opposite case. Let D be the diagonal matrix where Da 
indicates the number of relationships for the node i, i.e., Da = J2]=i^ij- The regularized 
Laplacian kernel K is then defined by: 



K 



L + i^I„ 



where L = I„ — D~2XD~2 is the normalized Laplacian of the network, is a positive value 
and In is the identity matrix of size n. 

The social network studied here is from the National Longitudinal Study of Adolescent Health 
and it is a part of a big dataset, usually called the aAlJAdd HealthaAi dataset. The data were 
collected in 1994-95 within 80 high-schools and 52 middle schools in the USA. The whole study 
is detailed in [12]. In addition to personal and social information, each student was asked to 
nominate his best friends. We consider here the social network based on the answers of 67 
students from a single school, treating the grade of each student as the class variable. Two 
adolescents who nominated nobody were removed from the network. We therefore consider a 
whole dataset made of 65 students distributed into 5 classes: grade 7 to grade 11. 

We first selected by cross-validation the kernel parameter on a learning sample and the 
threshold parameter for the intrinsic dimensions was set to 0.2. The most adapted value for 
V was 4 and this gives on average 96.92% of correct classification for the test nodes. Remark 
that u turned out not to be a sensitive parameter and we obtain satisfying results for a large 
range of values of i^. Figure[3] presents the kernel associated with the selected value of z^. Since 
network visualization is an important issue in network analysis, we then kept these parameters 
to visualize the whole network in the feature subspace of each class. Figure |4] presents the 
visualization of the network into the feature subspace of the classes 2 and 4. Both visualizations 
turn out to be very informative and, in particular, the visualization on the feature subspace 
of the 4th class (grade 10) is particularly useful to understand the network. It is interesting 
to notice that the network is almost organized along a 1-dimensional manifold (an half-circle 



18 



50 100 150 200 250 300 350 400 

Senators 

Figure 5: Votes (yea, nay or unknown) for each of the U.S. House of Representatives con- 
gressmen on 16 key votes in 1984. Yeas are in indicated in white, nays in gray and 
missing values in black. The first 168 congressmen are republicans whereas the 267 
last ones are democrats. 

here) which is consistent with the nature of the network: students of different classes. The 
specific form of the representation is due here to some relations between students of grade 
7 and 10 (students of the same family perhaps). We also remark that the classes are quite 
well separated and most of the relationships between students of different classes are between 
consecutive grades. This suggests that relationships between classes are due to students who 
failed to move to the upper grade and who may keep contact with old friends. It is in addition 
interesting to notice that this visualization is very close to the one obtained on the same 
network by Hoff, Handcock and Raftery in [II] using the so-called "latent space model". 

5.4 Classification of categoretical data: the house-vote dataset 

We focus now on categorical data which are also very frequent in scientific fields. We consider 
here the task of clustering (unsupervised classification) and therefore the pgpEM algorithm. 
To evaluate the ability of pgpEM to classify categorical data, we used the U.S. House Votes 
data set from the UCI repository. This data set is a record of the votes (yea, nay or unknown) 
for each of the U.S. House of Representatives congressmen on 16 key votes in 1984. These 
data were recorded during the during the third and fourth years of Ronald Reagan's Presidency. 
At this time, the republicans controlled the Senate, while the democrats controlled the House 
of Representatives. Figure |5] shows the database where yeas are in indicated in white, nays 
in gray and missing values in black. The first 168 congressmen are republicans whereas the 
267 last ones are democrats. As we can see, the considered votes are very discriminative since 
republicans and democrats vote differently in almost all cases while most of the congressmen 
follow the majority vote in their group. We can however notice that a significant part (around 
50 congressmen) of the democrats tend to vote differently from the other democrats. 

To cluster this dataset, we first build a kernel from the categorical observations (16 qualita- 
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Figure 6: Kernel based on the Hamming distance (left) computed on the house-vote dataset 
and clustering results (right) obtained with pgpEM. For the kernel, blue and red pixels 
correspond respectively to low and high values. The clustering results are presented 
through a binary matrix where a black pixel indicates a common membership between 
two senators. 

tive variables with 3 possible values: yea, nay or ?). We chose a kernel, proposed in [7j, based 
on the Hamming distance which measures the minimum number of substitutions required to 
change one observation into another one. Figure [6] presents the resulting kernel (left panel) 
and the clustering result obtained with the pgpEM algorithm. The clustering results are pre- 
sented through a binary matrix where a black pixel indicates a common membership between 
two senators and a white pixel means different memberships for the two senators. The pgpEM 
algorithm was used with the model A^o. with a number of group equals to 2 and the Cattell's 
threshold was set to 0.2. The clustering accuracy between the obtained partition of the data 
and the democrat/republican partition was 84.37% on this example. As one can observe, the 
pgpEM algorithm globally succeeds in recovering the partition of the House of Representa- 
tives. It is also interesting to notice that most of the congressmen which are not correctly 
classified are those who tend to vote differently from the majority vote in their group. Finally, 
the PgpEM algorithm allows to visualize the observed categorical data into the (quantitative) 
feature subspace of the two groups. Figure [7] presents these visualizations. The observation of 
these two plots confirms the fact that republicans voted more homogeneously than democrats 
in 1984 since there is no clear concentration of points on both plots for the democrats. 

5.5 Classification of mixed data: the Thyroid dataset 

In this final experiment, we consider the supervised classification of mixed data which is more 
and more a frequent case. Indeed, it is usual to collect for the same individuals both quantitative 
and categorical data. For instance, in Medicine, several quantitative features can be measured 
for a patient (blood test results, blood pressure, morphological characteristics, ...) and these 
data can be completed by answers of the patient on its general health conditions (pregnancy, 
surgery, tabacco, ...). The Thyroid dataset considered here is from the UCI repository and 
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Figure 7: Visualization of the house-vote data in the feature subspace of the republican (left) 
and the democrat (right) groups (red crosses denote republicans and blue circles 
denote democrats). 
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Figure 8: Quantitative (left) and categorical (center) data kernels and the combined kernel 
(right) for the Thyroid dataset (mixed data). 



contains thyroid disease records supplied by the Caravan Institute, Sydney, Australia. The 
dataset contains 665 records on male patients for which the answers (true of false) on 14 
questions have been collected as well as 6 blood test results (quantitative measures). Among 
the 665 patients of the study, 61 suffer from a thyroid disease. 

To make pgpDA able to deal with such data, we built a combined kernel by mixing a kernel 
based on the Hamming distance [7j (same kernel as in the previous section) for the categorical 
features and a Caussian kernel for the quantitative data. We chose to combine both kernels 
simply as follows: 

K{xj,xe) = aKi{xj,Xi) + (1 - a)K2ixj,Xi), 

where Ki and K2 are the kernels computed respectively on the categorical and quantitative 
features. Another solution would be to multiply both kernels. We refer to [18J for further 
details on multiple kernel learning. 

We selected the optimal set of kernel parameters by cross-validation on a learning part of 
the data. The model for pgpDA was the model M-o with the Cattell's threshold set to 0.2. 
The mixing parameter a for kernels was set to 0.5 in order not to favor any kernel but it is 
expected an improvement of the results if this parameter is tuned too. Kernel parameters have 
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Method 


pgpDA on 
quantitative data 


pgpDA on 
categorical data 


pgpDA with the 
combined kernel 


TP rate 


74.86 


96.00 


75.88 


FP rate 


22.16 


95.53 


21.97 



Table 5: Classification results on test sets for the Thyroid dataset (mixed data). Results are 
averaged on 25 replications of the experiment. 

been tuned by cross-validation on a learning sample and the kernels associated to these values 
are presented in Figure [51 The rows and columns of the matrices are sorted according to the 
class memberships (healthy or sick) and the sick patients are the last ones. We then compared 
the performance of pgpDA with the combined kernel to pgpDA with, on the one hand, a 
simple RBF kernel built only on the quantitative variables of the dataset and, on the other 
hand, a Hamming kernel built only on the categorical variables. Table[5] presents both the true 
positive (TP) and false positive (FP) rates obtained on 25 replications of the classification 
experiment for pgpDA on quantitative data, on categorical data and on the mixed data. It 
turns out that quantitative data contains most of the important information to discriminate 
the patients with thyroid diseases and that categorical data, when considered alone, are not 
enough to build an efficient classifier. However, it appears that the use of the categorical 
features in combination with the quantitative data allows to slightly improve the prediction of 
thyroid diseases (increases the TP rate and decreases the FP rate). In particular, the reduction 
of the FP rate is important here since it implies an important reduction of the number of false 
alarms. 

6 Conclusion 

This work has introduced a family of parsimonious Gaussian process models for the super- 
vised and unsupervised classification of quantitative and non-quantitative data. The proposed 
parsimonious models are obtained by constraining the eigen-decomposition of the Gaussian 
processes modeling each class. They allow in particular to use non-linear mapping functions 
which project the observations into an infinite dimensional space and to build, from a finite 
sample, a model-based classifier in this space. It has been also demonstrated that the building 
of the classifier can be directly done from the observation space through a kernel, avoiding the 
explicit knowledge of the mapping function. It has been possible to classify data of various 
nature including categorical data, functional data, networks and even mixed data by combining 
different kernels. The methodology is as well extended to the unsupervised classification case. 
Numerical experiments on benchmark data sets have shown that pgpDA performs similarly or 
better compared to the best kernel methods of the state of the art. The possibility to examine 
the model parameters and to visualize the data into the class-specific feature subspaces per- 
mits a finer interpretation of the results than with conventional discriminative kernel methods. 
Among the possible extensions of this work, it would be interesting to extend the methodology 
to the semi-supervised case in which only a few observations are labeled. 
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Appendix: Proofs 

Proof of Proposition [I] Recalling that dmax = max((ii, dfc), the classification function 
can be rewritten as: 



Di{(p{x)) =J2^ < '^(x) - >L2 log(A) - 21og(7ri) + 7, 



di 

E 



where 7 = (r — dmax) log(A) is a constant term which does not depend on the index i of the 
class. In view of the assumptions, Di{ip{x)) can be also rewritten as: 

di 1 ^ 



til 

+ log(-^ii) + (*^max " di) log(A) - 2 log(7ri) + 



7- 



Introducing the norm ||.||l2 associated with the scalar product < .,. and in view of 
Proposition 1 of p. 208], we finally obtain: 

+ ^log(Aij) + (dmax - (ij)log(A) - 21og(7ri) +7, 

which is the desired result. □ 
Proof of Proposition [2] The proof involves three steps. 

i) Computation of the projection < ip{x) — fli,qij >l2 '■ Since {Xij,qij) is solution of the 
Fredholm-type equation, it follows that, for all t G [0,1], 

>'ijQij{t) = / ti{s,t)qij{s)ds 
JO 

= — E < 'fixe) - fii,qij >L2 {v{xe){t) - (li{t)). (9) 

This implies that qij lies in the linear subspace spanned by the {^fixe) — fii), xi € Cj. As a 
consequence, the rank of the operator Ej is finite and is at most = min(nj,r). It therefore 
exists jiiji G M such that: 

ftj = E AiK</'(a^^) - h) (10) 
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leading to: 

< ^{x) - qij >L2= I — — Pijf.Pi{x^xe)^ (11) 
for all J = 1, . . . ,rj. The estimated classification function has therefore the following form: 

2 



X 



A(</'(x)) = — ^ J- ( J- - i I I pi,jiPi{x,xe)\ +\pi{x, 

di 

+ Y log(Aii) + ("^max - di) log(A) - 21og(7ri), 

for all i = 1, . . . , A;. 

ii) Computation of the /3jj£ and Aj^: Replacing ([TO]) in the Fredholm-type equation ([9]) it 
follows that 

— Y Pijeifi^e) - P'i)Pi{xe,xe) = Xij ^ (3iji>{(p{x(>>) - fli). 

Finally, projecting this equation on (p{xm) — Pi for Xm G Cj yields 

— Y l3ije'Pi{xe,Xm)pi{xe,X£>) = Xij ^ l3ije'Pi{xm,x£>). 

* X(,x^i£Ci Xfi^d 

Recalling that Mj is the matrix x rn defined by {Mi)i^£i = pi{xe, xei)/ni and introducing Pij 
the vector of defined by {I3ij)i = Piji, the above equation can be rewritten as M^/3ij = 
XijMiPij or, after simplification Mifiij = XijPij. As a consequence, Ajj is the jth largest 
eigenvalue of Mi and j^ij is the associated eigenvector for all 1 < j < di. Let us note that the 
constraint \\qij\\ = 1 can be rewritten as l^ljl^ij = 1- 

iii) Computation of A: Remarking that trace(Sj) = trace(M() + X]j=r,+i follows: 

and the proposition is proved. □ 
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Proof of Proposition [3] It is sufficient to prove that qij and Ajj are respectively the jth 
normed eigenvector and eigenvalue of Sj. First, 



rii 
1 1 



(3iji{xe - Hi){xi' - iJLi)\xi- Hi 



y TLiXij x^i,xe(^Ci 

{Mi)i^i:f3iji{xi> - fLi) 

y riiXij x^,,xi&Ci 

^ B-^ Y {Mil3ij)e{xe - Hi), 



' TliXij Xgi&d 

and remarking that jiij is eigenvector of Mi, it follows: 



^igij — \j I ^ -B ^ ^ (3ij£i{x£i — Hi) — XijQij. 

'riiXij Xfi£Ci 



Second, straightforward algebra shows that 



niXij ^^g^^ x^,&c^ 



= Y l^ijePiji'i^e - f^if{xii - Hi) 

'^i^ij Xff,Xl£Cr 

= X! {Mi)e/'l3ije(]ije' 

*i Xgi,xi(iCi 

\ 

= -^qljMiqij = 1, 
Xij 

and the result is proved. □ 

Proof of Proposition [4] For all ^ = 1, . . . , L, the £th coordinate of the mapping function 
is defined as the £th coordinate of the function x expressed in the truncated basis 
More specifically, 

:{t) = Y'Piix)beit), 



L 

X( '] 

1=1 



for all t G [0, 1] and thus, for all j = 1, . . . , L, we have 

7j(x)= / x{t)bj{t)dt = YMx) bj{t)be{t)dt = YBj 
Jo ^0 



Bjiipi{x). 
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As a consequence, (p{x) = B ^^{x) and K{x,y) = ^{x^B ^7(2/). Introducing 

it follows that pi{x, y) = {^{x) — jiYB~^{j{y) — ji). Let us first show that qij is eigenvector 
of B~^t,i. Recalling that 

y UiXij xeeCi 

we have 

'riiXij '""^ Xfi,eCi xeeCi 
^ B-^- Y Piji{lM-li){lM-^iYB~\j{xe)-^i) 



\J riiXij xii,xi^c, 



B ^— Yl Pijdlixe) -'ji)piixi,xe) 



1 



B ^ J2 {Mi)i^i,Piji{^{xe)--n) 



\j fliXij Xf,,xeeCi 

^ B-' J2 {MiPij)eij{xe)-7i). 



ifliXij x^/£Ci 

Remarking that Pij is eigenvector of Mj, it follows: 

B'^tiQij = Xij }-— B~^ Y f^ijeilixe) - li) = XijQij. 

TliXij x^iGCi 

Let us finally compute the norm of qij-. 

\\Qij\\^ = ^ E f3ijeil{xe)-iiYB-' ^ (3^Je'{lixe) - ji) 
niAij x^&Ci x^,&c^ 

= Y l^ijil^ijeilixe) - 7i)*5-^(7(x^/) - 7i) 

"»Aij Xi,,xeeCi 

^ij x^i,xt&Ci 

= j-QijMiQij = 1, 
Xij 

and the result is proved. □ 
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